HW5 Michal Grotkowski¶
Task 1.¶
So we get: $g_1^{PD}(z) = z^2 + \dfrac 13 $
Task 2.¶
For this task I have a trained an xgboost model on adult income dataset. We will study these two following observations:
X_test.iloc[[100,2]]
| age | fnlwgt | educational-num | capital-gain | capital-loss | hours-per-week | workclass_Federal-gov | workclass_Local-gov | workclass_Never-worked | workclass_Private | ... | native-country_Portugal | native-country_Puerto-Rico | native-country_Scotland | native-country_South | native-country_Taiwan | native-country_Thailand | native-country_Trinadad&Tobago | native-country_United-States | native-country_Vietnam | native-country_Yugoslavia | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 784 | 49 | 139571 | 9 | 4064 | 0 | 40 | 0 | 0 | 0 | 1 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
| 524 | 53 | 139671 | 10 | 0 | 0 | 50 | 0 | 1 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
2 rows × 100 columns
Some important differences between these two observations are that one person works in a private sector while the other is a government worker. Here are the predictions of the xgboost model with $1$ meaning that their income will be more than 50K USD.
print("predicted classes: %s and %s" % (model.predict(X_test.iloc[[100]])[0],model.predict(X_test.iloc[[2]])[0]))
print("true classes %s and %s" % (target_test.iloc[100], target_test.iloc[2]))
predicted classes: 0 and 1 true classes 0 and 0
Ceteris Paribus profiles for age and capital-loss for respective observations are located below. For the private sector worker the age profile is mostly flat while the other one experiences much bigger changes. This may indicate that age plays less of a role in assigning a higher income by the model for workers in the private sector.
The capital-loss curves also differ such that the influence of capital-loss value for the private worker is either neutral or positive in assigning a high income class, while for the government worker we can see a negative influence of a higher value as well. This is surprising as we would expect that capital-loss would be inversely correlated with higher income, but private workers could take more financial risks. This fluctuating pattern may also be caused by out-of-distribution problem as the vast majority of capital-loss values in the dataset are near zero.
cp.plot(variables = ["age", "capital-loss"], title = "Private sector worer")
cp2.plot(variables = ["age", "capital-loss"], title = "Government worker")
I have trained a Logistic Regression model as well and below are the Partial Dependence profiles for age. PDP profiles for logistic regression will all create a logistic curve with the rate of its growth indicating the value of coefficient associated with age. We can see that on average for both models age increase indicates a higher chance of the model assigning high income to observation. This relationship is monotonic for logistic regression while for XGBoost we can see a large bump between the ages of 70-80. This may indicate the presence of a higher number of observations in the training set in this age bracket that do in fact earn more.
pdp.plot(variables=["age"], geom="profiles", title="Partial Dependence Plot - XGBoost ")
pdp2.plot(variables=["age"], geom="profiles", title="Partial Dependence Plot - Logistic Regression")
Appenidx¶
import plotly.io as pio
pio.renderers.default = "jupyterlab"
from google.colab import files
files.upload()
!unzip archive.zip
--------------------------------------------------------------------------- KeyboardInterrupt Traceback (most recent call last) <ipython-input-2-d986f439b1ea> in <cell line: 2>() 1 from google.colab import files ----> 2 files.upload() 3 get_ipython().system('unzip archive.zip') /usr/local/lib/python3.10/dist-packages/google/colab/files.py in upload() 67 """ 68 ---> 69 uploaded_files = _upload_files(multiple=True) 70 # Mapping from original filename to filename as saved locally. 71 local_filenames = dict() /usr/local/lib/python3.10/dist-packages/google/colab/files.py in _upload_files(multiple) 154 155 # First result is always an indication that the file picker has completed. --> 156 result = _output.eval_js( 157 'google.colab._files._uploadFiles("{input_id}", "{output_id}")'.format( 158 input_id=input_id, output_id=output_id /usr/local/lib/python3.10/dist-packages/google/colab/output/_js.py in eval_js(script, ignore_result, timeout_sec) 38 if ignore_result: 39 return ---> 40 return _message.read_reply_from_input(request_id, timeout_sec) 41 42 /usr/local/lib/python3.10/dist-packages/google/colab/_message.py in read_reply_from_input(message_id, timeout_sec) 94 reply = _read_next_input_message() 95 if reply == _NOT_READY or not isinstance(reply, dict): ---> 96 time.sleep(0.025) 97 continue 98 if ( KeyboardInterrupt:
import pandas as pd
df = pd.read_csv("adult.csv")
df["income"] = pd.get_dummies(df["income"], drop_first = True)
cols=df.select_dtypes(exclude=['int', 'uint8']).columns.to_list()
df[cols] = df[cols].astype("category")
X = df.iloc[:, :-1]
target = df["income"]
X.head()
| age | workclass | fnlwgt | education | educational-num | marital-status | occupation | relationship | race | gender | capital-gain | capital-loss | hours-per-week | native-country | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 25 | Private | 226802 | 11th | 7 | Never-married | Machine-op-inspct | Own-child | Black | Male | 0 | 0 | 40 | United-States |
| 1 | 38 | Private | 89814 | HS-grad | 9 | Married-civ-spouse | Farming-fishing | Husband | White | Male | 0 | 0 | 50 | United-States |
| 2 | 28 | Local-gov | 336951 | Assoc-acdm | 12 | Married-civ-spouse | Protective-serv | Husband | White | Male | 0 | 0 | 40 | United-States |
| 3 | 44 | Private | 160323 | Some-college | 10 | Married-civ-spouse | Machine-op-inspct | Husband | Black | Male | 7688 | 0 | 40 | United-States |
| 4 | 18 | ? | 103497 | Some-college | 10 | Never-married | ? | Own-child | White | Female | 0 | 0 | 30 | United-States |
from sklearn.model_selection import train_test_split
X_train, X_test, target_train, target_test = train_test_split(pd.get_dummies(X, drop_first = True), target, random_state = 42, train_size = 0.8, stratify = target)
import xgboost as xgb
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
linear_model = make_pipeline(StandardScaler(), LogisticRegression())
model = xgb.XGBClassifier()
model.fit(X_train, target_train)
linear_model.fit(X_train, target_train)
print(accuracy_score(target_test, model.predict(X_test)),accuracy_score(target_test, linear_model.predict(X_test)))
0.8754222540689938 0.8537209540382844
!pip install dalex
Requirement already satisfied: dalex in /usr/local/lib/python3.10/dist-packages (1.6.0) Requirement already satisfied: setuptools in /usr/local/lib/python3.10/dist-packages (from dalex) (67.7.2) Requirement already satisfied: pandas>=1.2.5 in /usr/local/lib/python3.10/dist-packages (from dalex) (1.5.3) Requirement already satisfied: numpy>=1.20.3 in /usr/local/lib/python3.10/dist-packages (from dalex) (1.23.5) Requirement already satisfied: scipy>=1.6.3 in /usr/local/lib/python3.10/dist-packages (from dalex) (1.11.3) Requirement already satisfied: plotly>=5.1.0 in /usr/local/lib/python3.10/dist-packages (from dalex) (5.15.0) Requirement already satisfied: tqdm>=4.61.2 in /usr/local/lib/python3.10/dist-packages (from dalex) (4.66.1) Requirement already satisfied: python-dateutil>=2.8.1 in /usr/local/lib/python3.10/dist-packages (from pandas>=1.2.5->dalex) (2.8.2) Requirement already satisfied: pytz>=2020.1 in /usr/local/lib/python3.10/dist-packages (from pandas>=1.2.5->dalex) (2023.3.post1) Requirement already satisfied: tenacity>=6.2.0 in /usr/local/lib/python3.10/dist-packages (from plotly>=5.1.0->dalex) (8.2.3) Requirement already satisfied: packaging in /usr/local/lib/python3.10/dist-packages (from plotly>=5.1.0->dalex) (23.2) Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.10/dist-packages (from python-dateutil>=2.8.1->pandas>=1.2.5->dalex) (1.16.0)
import dalex as dx
def predict_func(model, df):
return model.predict_proba(df)[:, 1]
explainer = dx.Explainer(model, X_test, target_test, predict_function = predict_func)
explainer.model_performance()
Preparation of a new explainer is initiated -> data : 9769 rows 100 cols -> target variable : Parameter 'y' was a pandas.Series. Converted to a numpy.ndarray. -> target variable : 9769 values -> model_class : xgboost.sklearn.XGBClassifier (default) -> label : Not specified, model's class short name will be used. (default) -> predict function : <function predict_func at 0x7fde2ee6cb80> will be used -> predict function : Accepts pandas.DataFrame and numpy.ndarray. -> predicted values : min = 1.44e-05, mean = 0.237, max = 1.0 -> model type : classification will be used (default) -> residual function : difference between y and yhat (default) -> residuals : min = -0.986, mean = 0.00211, max = 1.0 -> model_info : package xgboost A new explainer has been created!
| recall | precision | f1 | accuracy | auc | |
|---|---|---|---|---|---|
| XGBClassifier | 0.65355 | 0.789664 | 0.715188 | 0.875422 | 0.929625 |
for col in X_test.columns:
print(col)
age fnlwgt educational-num capital-gain capital-loss hours-per-week workclass_Federal-gov workclass_Local-gov workclass_Never-worked workclass_Private workclass_Self-emp-inc workclass_Self-emp-not-inc workclass_State-gov workclass_Without-pay education_11th education_12th education_1st-4th education_5th-6th education_7th-8th education_9th education_Assoc-acdm education_Assoc-voc education_Bachelors education_Doctorate education_HS-grad education_Masters education_Preschool education_Prof-school education_Some-college marital-status_Married-AF-spouse marital-status_Married-civ-spouse marital-status_Married-spouse-absent marital-status_Never-married marital-status_Separated marital-status_Widowed occupation_Adm-clerical occupation_Armed-Forces occupation_Craft-repair occupation_Exec-managerial occupation_Farming-fishing occupation_Handlers-cleaners occupation_Machine-op-inspct occupation_Other-service occupation_Priv-house-serv occupation_Prof-specialty occupation_Protective-serv occupation_Sales occupation_Tech-support occupation_Transport-moving relationship_Not-in-family relationship_Other-relative relationship_Own-child relationship_Unmarried relationship_Wife race_Asian-Pac-Islander race_Black race_Other race_White gender_Male native-country_Cambodia native-country_Canada native-country_China native-country_Columbia native-country_Cuba native-country_Dominican-Republic native-country_Ecuador native-country_El-Salvador native-country_England native-country_France native-country_Germany native-country_Greece native-country_Guatemala native-country_Haiti native-country_Holand-Netherlands native-country_Honduras native-country_Hong native-country_Hungary native-country_India native-country_Iran native-country_Ireland native-country_Italy native-country_Jamaica native-country_Japan native-country_Laos native-country_Mexico native-country_Nicaragua native-country_Outlying-US(Guam-USVI-etc) native-country_Peru native-country_Philippines native-country_Poland native-country_Portugal native-country_Puerto-Rico native-country_Scotland native-country_South native-country_Taiwan native-country_Thailand native-country_Trinadad&Tobago native-country_United-States native-country_Vietnam native-country_Yugoslavia
cp = explainer.predict_profile(new_observation = X_test.iloc[[100]])
Calculating ceteris paribus: 100%|██████████| 100/100 [00:04<00:00, 22.92it/s] /usr/local/lib/python3.10/dist-packages/dalex/predict_explanations/_ceteris_paribus/utils.py:30: PerformanceWarning: DataFrame is highly fragmented. This is usually the result of calling `frame.insert` many times, which has poor performance. Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()` profiles.loc[:, '_label_'] = explainer.label
cp.plot(variables = ["age", "capital-loss"])
cp2 = explainer.predict_profile(new_observation = X_test.iloc[[2]])
cp2.plot(variables = ["age", "capital-loss"])
Calculating ceteris paribus: 100%|██████████| 100/100 [00:03<00:00, 25.04it/s] /usr/local/lib/python3.10/dist-packages/dalex/predict_explanations/_ceteris_paribus/utils.py:30: PerformanceWarning: DataFrame is highly fragmented. This is usually the result of calling `frame.insert` many times, which has poor performance. Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
X_test.iloc[[100, 2]]
| age | fnlwgt | educational-num | capital-gain | capital-loss | hours-per-week | workclass_Federal-gov | workclass_Local-gov | workclass_Never-worked | workclass_Private | ... | native-country_Portugal | native-country_Puerto-Rico | native-country_Scotland | native-country_South | native-country_Taiwan | native-country_Thailand | native-country_Trinadad&Tobago | native-country_United-States | native-country_Vietnam | native-country_Yugoslavia | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 784 | 49 | 139571 | 9 | 4064 | 0 | 40 | 0 | 0 | 0 | 1 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
| 524 | 53 | 139671 | 10 | 0 | 0 | 50 | 0 | 1 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
2 rows × 100 columns
pdp = explainer.model_profile(grid_points = 50)
Calculating ceteris paribus: 100%|██████████| 100/100 [00:09<00:00, 10.70it/s] /usr/local/lib/python3.10/dist-packages/dalex/predict_explanations/_ceteris_paribus/utils.py:30: PerformanceWarning: DataFrame is highly fragmented. This is usually the result of calling `frame.insert` many times, which has poor performance. Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
pdp.plot(variables=["age", ], geom="profiles", title="Partial Dependence Plot with individual profiles")